Name: Divyanshu Singh, Asang Kumar & Jaspreet Singh
Group No: 86
Dataset: Water Potability Prediction Dataset
Water Potability Prediction Assignment 1 – PS-2
Access to safe drinking-water is essential to health, a basic human right and a component of effective policy for health protection. This is important as a health and development issue at a national, regional and local level. The objective is to predict the potability of water based on given features. The water_potability.csv file contains water quality metrics for 3276 different water bodies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, roc_curve
from sklearn.model_selection import cross_validate, cross_val_score
df = pd.read_csv('../water_potability.csv')
df
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | 204.890455 | 20791.318981 | 7.300212 | 368.516441 | 564.308654 | 10.379783 | 86.990970 | 2.963135 | 0 |
| 1 | 3.716080 | 129.422921 | 18630.057858 | 6.635246 | NaN | 592.885359 | 15.180013 | 56.329076 | 4.500656 | 0 |
| 2 | 8.099124 | 224.236259 | 19909.541732 | 9.275884 | NaN | 418.606213 | 16.868637 | 66.420093 | 3.055934 | 0 |
| 3 | 8.316766 | 214.373394 | 22018.417441 | 8.059332 | 356.886136 | 363.266516 | 18.436524 | 100.341674 | 4.628771 | 0 |
| 4 | 9.092223 | 181.101509 | 17978.986339 | 6.546600 | 310.135738 | 398.410813 | 11.558279 | 31.997993 | 4.075075 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3271 | 4.668102 | 193.681735 | 47580.991603 | 7.166639 | 359.948574 | 526.424171 | 13.894419 | 66.687695 | 4.435821 | 1 |
| 3272 | 7.808856 | 193.553212 | 17329.802160 | 8.061362 | NaN | 392.449580 | 19.903225 | NaN | 2.798243 | 1 |
| 3273 | 9.419510 | 175.762646 | 33155.578218 | 7.350233 | NaN | 432.044783 | 11.039070 | 69.845400 | 3.298875 | 1 |
| 3274 | 5.126763 | 230.603758 | 11983.869376 | 6.303357 | NaN | 402.883113 | 11.168946 | 77.488213 | 4.708658 | 1 |
| 3275 | 7.874671 | 195.102299 | 17404.177061 | 7.509306 | NaN | 327.459760 | 16.140368 | 78.698446 | 2.309149 | 1 |
3276 rows × 10 columns
# Print the first two rows
print(df.head(2))
ph Hardness Solids Chloramines Sulfate Conductivity \ 0 NaN 204.890455 20791.318981 7.300212 368.516441 564.308654 1 3.71608 129.422921 18630.057858 6.635246 NaN 592.885359 Organic_carbon Trihalomethanes Turbidity Potability 0 10.379783 86.990970 2.963135 0 1 15.180013 56.329076 4.500656 0
labels = ['Non-Potable', 'Potable']
data = [df['Potability'].value_counts()[0],
df['Potability'].value_counts()[1]
]
fig1, ax1 = plt.subplots(figsize=[15,6])
color = ['red', 'blue']
ax1.pie(data, labels=labels,explode=[0.05]*2, autopct='%1.1f%%',pctdistance=0.5, shadow=True, colors = color)
plt.title("Water Potability", fontsize=20);
plt.show()
df.Potability.value_counts()
0 1998 1 1278 Name: Potability, dtype: int64
Comment on class imbalance with appropriate visualization method.
Based on the class distribution you provided (1998 (61%) non-potable and 1278 (39%) potable), it indicates a class imbalance in the target variable.
In this case, the majority class is the "non-potable" class, which constitutes 61% of the dataset, while the minority class is the "potable" class, making up only 39% of the dataset.
Class imbalances can impact the performance of machine learning models, particularly when the minority class is underrepresented. Models trained on imbalanced data tend to be biased towards the majority class, leading to lower accuracy and predictive performance on the minority class.
Logically, we would prefer a model that will have more false negatives rather than a model that has more false positives.
plt.style.use('seaborn-dark')
colors=['#00a8e8', '#00afb9', '#48bfe3', '#006e90', '#20a4f3', '#00b4d8', '#0466c8', '#20a4f3', '#00008B','#1E90FF']
i = 0
while i < 10:
for col in df.columns:
plt.figure(figsize=(6, 4))
sns.displot(df[col], color=colors[i], kde=True)
plt.title(f'Distribution plot for {col}')
plt.xlabel(f'Skewness = {round(df[col].skew(), 3)}', fontsize=14)
i += 1
plt.show()
<Figure size 600x400 with 0 Axes>
<Figure size 600x400 with 0 Axes>
<Figure size 600x400 with 0 Axes>
<Figure size 600x400 with 0 Axes>
<Figure size 600x400 with 0 Axes>
<Figure size 600x400 with 0 Axes>
<Figure size 600x400 with 0 Axes>
<Figure size 600x400 with 0 Axes>
<Figure size 600x400 with 0 Axes>
<Figure size 600x400 with 0 Axes>
Relationship Between Each Variable and Target Variable (Potability)
# Iterate over each feature column
for col in df.columns[:-1]:
plt.figure(figsize=(8, 6))
sns.boxplot(data=df, x="Potability", y=col)
plt.title(f"Box Plot of {col} with Potability", size=10)
plt.show()
sns.pairplot(df,hue='Potability')
<seaborn.axisgrid.PairGrid at 0x223e3435eb0>
Checking for the missing/NAN values.
plt.title('Missing Values Per Feature')
nans = df.isna().sum().sort_values(ascending=False).to_frame()
sns.heatmap(nans,annot=True,fmt='d',cmap='vlag')
<AxesSubplot:title={'center':'Missing Values Per Feature'}>
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap="Reds")
plt.title("Correlations Between Variables", size=16)
plt.show()
corr_matrix = df.corr()
corr_matrix["Potability"].sort_values(ascending=False)
Potability 1.000000 Solids 0.033743 Chloramines 0.023779 Trihalomethanes 0.007130 Turbidity 0.001581 ph -0.003556 Conductivity -0.008128 Hardness -0.013837 Sulfate -0.023577 Organic_carbon -0.030001 Name: Potability, dtype: float64
The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that there is a strong positive correlation. When it is close to -1, it means that there is a negative correlation.
It appears that there is no linear/ranked correlation between our output label and our features. So we can say that all the features are important and we cannot drop any of the feature. So initial thoughts are this data cant be modeled the simple classifier, as its highly tangled and non linear in nature. In other words, this is not going to have effect on feature selection.
plt.title('Missing Values Per Feature')
nans = df.isna().sum().sort_values(ascending=False).to_frame()
sns.heatmap(nans,annot=True,fmt='d',cmap='vlag')
<AxesSubplot:title={'center':'Missing Values Per Feature'}>
Observation:
Overall data is too small, so removing the null or missing values will make it further less which we don't want. So we'll fill those NAN/missung values. We can use techniques like mean, median, or mode imputation to fill in missing values using "fillna()" in pandas.\
We are going with the mean beacuse of the below reasons:
Mean imputation is often used when the missing values are numerical and the distribution of the variable is approximately normal.
Median imputation is preferred when the distribution is skewed, as the median is less sensitive to outliers than the mean.
Mode imputation is suitable for categorical variables or numerical variables with a small number of unique values.
Because our dataset has a normal distribution, so Mean imputation would be better option.
df.fillna(df.mean(), inplace=True)
df.head()
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.080795 | 204.890455 | 20791.318981 | 7.300212 | 368.516441 | 564.308654 | 10.379783 | 86.990970 | 2.963135 | 0 |
| 1 | 3.716080 | 129.422921 | 18630.057858 | 6.635246 | 333.775777 | 592.885359 | 15.180013 | 56.329076 | 4.500656 | 0 |
| 2 | 8.099124 | 224.236259 | 19909.541732 | 9.275884 | 333.775777 | 418.606213 | 16.868637 | 66.420093 | 3.055934 | 0 |
| 3 | 8.316766 | 214.373394 | 22018.417441 | 8.059332 | 356.886136 | 363.266516 | 18.436524 | 100.341674 | 4.628771 | 0 |
| 4 | 9.092223 | 181.101509 | 17978.986339 | 6.546600 | 310.135738 | 398.410813 | 11.558279 | 31.997993 | 4.075075 | 0 |
df.isnull().sum()
ph 0 Hardness 0 Solids 0 Chloramines 0 Sulfate 0 Conductivity 0 Organic_carbon 0 Trihalomethanes 0 Turbidity 0 Potability 0 dtype: int64
Dealing with outliers
What is an outlier??: Values that are far away from the mean. Box plot can give us a better view about the outliers.
Outlier Using The Box Plot
df.boxplot(figsize=(14,7))
<AxesSubplot:>
There is an outlier and it is found to be in "Solids".
df.describe()
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 |
| mean | 7.080795 | 196.369496 | 22014.092526 | 7.122277 | 333.775777 | 426.205111 | 14.284970 | 66.396293 | 3.966786 | 0.390110 |
| std | 1.469956 | 32.879761 | 8768.570828 | 1.583085 | 36.142612 | 80.824064 | 3.308162 | 15.769881 | 0.780382 | 0.487849 |
| min | 0.000000 | 47.432000 | 320.942611 | 0.352000 | 129.000000 | 181.483754 | 2.200000 | 0.738000 | 1.450000 | 0.000000 |
| 25% | 6.277673 | 176.850538 | 15666.690297 | 6.127421 | 317.094638 | 365.734414 | 12.065801 | 56.647656 | 3.439711 | 0.000000 |
| 50% | 7.080795 | 196.967627 | 20927.833607 | 7.130299 | 333.775777 | 421.884968 | 14.218338 | 66.396293 | 3.955028 | 0.000000 |
| 75% | 7.870050 | 216.667456 | 27332.762127 | 8.114887 | 350.385756 | 481.792304 | 16.557652 | 76.666609 | 4.500320 | 1.000000 |
| max | 14.000000 | 323.124000 | 61227.196008 | 13.127000 | 481.030642 | 753.342620 | 28.300000 | 124.000000 | 6.739000 | 1.000000 |
df['Solids'].describe()
count 3276.000000 mean 22014.092526 std 8768.570828 min 320.942611 25% 15666.690297 50% 20927.833607 75% 27332.762127 max 61227.196008 Name: Solids, dtype: float64
The maximum values in "Solids" is way ahead compared to the 75%. But for now we'll leaving the data as it is. Solids is an important attribute for deciding the water potability that's why it is kept as it is.
Skewness Correction
Skewness of a distribution is defined as the lack of symmetry. In a symmetrical distribution, the Mean, Meadian and Mode are equal.The normal distribution has a skewness of 0.
Skewness is of two types
Positive skewness: When the tail on the right side of the distribution is longer or fatter, we say the data is positively skewed. For a positive skewness mean > median > mode.
Negative skewness: When the tail on the left side of the distribution is longer or fatter, we say that the distribution is negatively skewed. For a negative skewness mean < median < mode.
The thumb rule is:
If the skewness is between -0.5 to +0.5, then data is fairly symmetrical.
If the skewness is between -1 to -0.5 or 0.5 to 1 then data is moderately skewed.
if the skewness is less than -1 and greater than +1 then our data is heavily skewed.
# Calculate skewness for each feature
skewness = df.skew()
# Print the skewness values
for feature, skew in skewness.items():
print(f"Skewness of {feature}: {skew}")
# Determine the number of rows and columns for subplots
n_features = len(df.columns)
n_rows = int(n_features / 3) + (n_features % 3 > 0)
n_cols = 3
# Visualize the transformed distributions with normal distribution curve
plt.figure(figsize=(15, 10))
for i, feature in enumerate(df.columns):
plt.subplot(n_rows, n_cols, i + 1)
sns.histplot(df[feature], kde=True)
plt.xlabel(feature)
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()
Skewness of ph: 0.027795881574044657 Skewness of Hardness: -0.0393417047767249 Skewness of Solids: 0.6216344855169127 Skewness of Chloramines: -0.01209844012438221 Skewness of Sulfate: -0.04118437258768767 Skewness of Conductivity: 0.2644902239264119 Skewness of Organic_carbon: 0.025532582090242492 Skewness of Trihalomethanes: -0.08516101983758483 Skewness of Turbidity: -0.007816642377392604 Skewness of Potability: 0.4507837491078628
Applying the thumb rule in the above data it can be said that our data is fairly symmetrical. That's the reason we are not going for skew correction.
In the context of the given problem, feature engineering techniques such as standardization and normalization can be applied to preprocess the data and improve the performance of machine learning models.
1. Standardization (Z-score normalization):
Justification: Standardization is a popular technique that transforms the data to have zero mean and unit variance. It brings all features to a similar scale, which can be beneficial for models that are sensitive to the scale of the features, such as linear regression, logistic regression, and support vector machines.
Usage: Standardization is typically applied when the features have different scales and units, and when the distribution of the data is approximately Gaussian or at least not heavily skewed. We are going with this because:
# Select the features to be standardized
features_to_standardize = ['ph', 'Hardness', 'Solids', 'Chloramines', 'Sulfate', 'Conductivity', 'Organic_carbon', 'Trihalomethanes', 'Turbidity']
# Perform standardization
scaler = StandardScaler()
data_standardized = scaler.fit_transform(df[features_to_standardize])
data_standardized = pd.DataFrame(data_standardized, columns=features_to_standardize)
# Visualize the standardized distributions
plt.figure(figsize=(12, 6))
for i, feature in enumerate(data_standardized.columns):
plt.subplot(3, 3, i+1)
sns.histplot(data_standardized[feature], kde=True)
plt.xlabel(feature + " (Standardized)")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()
df.head()
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.080795 | 204.890455 | 20791.318981 | 7.300212 | 368.516441 | 564.308654 | 10.379783 | 86.990970 | 2.963135 | 0 |
| 1 | 3.716080 | 129.422921 | 18630.057858 | 6.635246 | 333.775777 | 592.885359 | 15.180013 | 56.329076 | 4.500656 | 0 |
| 2 | 8.099124 | 224.236259 | 19909.541732 | 9.275884 | 333.775777 | 418.606213 | 16.868637 | 66.420093 | 3.055934 | 0 |
| 3 | 8.316766 | 214.373394 | 22018.417441 | 8.059332 | 356.886136 | 363.266516 | 18.436524 | 100.341674 | 4.628771 | 0 |
| 4 | 9.092223 | 181.101509 | 17978.986339 | 6.546600 | 310.135738 | 398.410813 | 11.558279 | 31.997993 | 4.075075 | 0 |
df.describe()
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 |
| mean | 7.080795 | 196.369496 | 22014.092526 | 7.122277 | 333.775777 | 426.205111 | 14.284970 | 66.396293 | 3.966786 | 0.390110 |
| std | 1.469956 | 32.879761 | 8768.570828 | 1.583085 | 36.142612 | 80.824064 | 3.308162 | 15.769881 | 0.780382 | 0.487849 |
| min | 0.000000 | 47.432000 | 320.942611 | 0.352000 | 129.000000 | 181.483754 | 2.200000 | 0.738000 | 1.450000 | 0.000000 |
| 25% | 6.277673 | 176.850538 | 15666.690297 | 6.127421 | 317.094638 | 365.734414 | 12.065801 | 56.647656 | 3.439711 | 0.000000 |
| 50% | 7.080795 | 196.967627 | 20927.833607 | 7.130299 | 333.775777 | 421.884968 | 14.218338 | 66.396293 | 3.955028 | 0.000000 |
| 75% | 7.870050 | 216.667456 | 27332.762127 | 8.114887 | 350.385756 | 481.792304 | 16.557652 | 76.666609 | 4.500320 | 1.000000 |
| max | 14.000000 | 323.124000 | 61227.196008 | 13.127000 | 481.030642 | 753.342620 | 28.300000 | 124.000000 | 6.739000 | 1.000000 |
From the above "Standardization" has impacted the data by transforming it based on two statistical parameters: the mean and the standard deviation:
1. Mean:
Before standardization: The mean of the original data represents the average value of the feature. It can be positive, negative, or zero.
After standardization: The mean of the standardized data becomes zero. This means that the distribution is centered around zero.
2. Standard Deviation:
Before standardization: The standard deviation of the original data measures the spread or variability of the feature values.
After standardization: The standard deviation of the standardized data becomes 1. This ensures that the spread of the data points is similar and facilitates comparison across different features.
Splitting data based on:
Dataset Size: The size of the dataset plays a role in determining the split ratio. We have enough data points to decide that 70:30 split would be a better approach. However, we have also tested our model with 80:20 and 5-fold validationd and found that the results are quite similar. So we are going with the standard 70:30
70:30 Train-Test Split Scores & Logistic Regression
# Separate the features and target variable
X = df.drop('Potability', axis=1)
y = df['Potability']
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Print the sizes of the training and test sets
print("Training set size:", X_train.shape[0])
print("Test set size:", X_test.shape[0])
Training set size: 2293 Test set size: 983
# Define the list of C values to be evaluated
C_values = [1, 0.5, 0.1, 0.01, 0.003]
# Perform model development and evaluation
best_score = 0
best_C = None
best_penalty = None
for penalty in ['l1', 'l2']:
for C in C_values:
# Build the Logistic Regression model
lr = LogisticRegression(penalty=penalty, C=C, solver='liblinear')
lr.fit(X_train, y_train)
# Evaluate the model
score = lr.score(X_test, y_test)
# Check if the current model is the best so far
if score > best_score:
best_score = score
best_C = C
best_penalty = penalty
# Print the best parameter and its corresponding accuracy score
print("Best Penalty:", best_penalty)
print("Best C:", best_C)
print("Accuracy:", best_score)
Best Penalty: l1 Best C: 1 Accuracy: 0.6276703967446592
The choice of the best parameter is based on the highest accuracy score achieved on the test set. In this case, the model with L1 penalty and C=1 yielded the highest accuracy among all the combinations evaluated.
L1 penalty (Lasso regularization) encourages sparse solutions by shrinking less important features towards zero, effectively performing feature selection. It can be useful when there are many features and some of them are irrelevant or have little impact on the target variable.
The chosen value of C, which represents the inverse of regularization strength, was 1. A larger value of C indicates weaker regularization, allowing the model to focus more on fitting the training data.
# Separate the features and target variable
X = df.drop('Potability', axis=1)
y = df['Potability']
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Build the Logistic Regression model with the best parameters
lr = LogisticRegression(penalty='l1', C=1, solver='liblinear')
lr.fit(X_train, y_train)
# Perform predictions on the test data
y_pred = lr.predict(X_test)
# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)
# Print the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
print("ROC AUC:", roc_auc)
# Create a confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Reds')
plt.title("Confusion Matrix")
plt.xlabel("Predicted Class")
plt.ylabel("True Class")
plt.show()
# Calculate the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
Accuracy: 0.6276703967446592 Precision: 0.0 Recall: 0.0 F1-Score: 0.0 ROC AUC: 0.5
C:\Users\jaspr\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\sklearn\metrics\_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
In this case, the model did not predict any instances as positive (potable water) in the test set, resulting in zero true positives and false negatives. This leads to an accuracy, precision, recall, and F1-score of 0 for the positive class.
Accuracy: The accuracy score from the 70:30 train-test split is approximately 0.628, indicating that the model correctly predicts the class label for around 63% of the instances in the test set.
Precision: The precision score from the 70:30 split is 0, indicating that the model did not predict any positive instances correctly in the test set.
Recall: The recall score from the 70:30 split is 0, suggesting that the model failed to identify any positive instances in the test set.
F1-Score: The F1-score from the 70:30 split is 0, reflecting the lack of both precision and recall for the positive class.
print("Accuracy on training set: {:.3f}".format(lr.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(lr.score(X_test, y_test)))
Accuracy on training set: 0.602 Accuracy on test set: 0.628
From the above accuracy data we can say that the model is close to underfitting or in other words, it is close to just right model.\
The model is unable to capture the realtionship between predictors and target. And thus, giving a poor training score.